Note: Any extension material won’t be assessed in the discipline quiz but may be of use for students who select this project as their interdisciplinary project.
Kidney transplantation is often the treatment of choice for people with end-stage kidney disease. However, despite advances in the field of transplantation, the proportion of patients who develop graft rejection after a kidney transplant remains high. Understanding the characteristics between stable patients and patients who experience rejection can be one of the ways to improve health outcomes for these people.
In this exercise we will visualise public datasets on kidney transplant patients. These are RNA-sequencing datasets that measure the gene expression of patients’ cells. We will then use the gene expression to predict the patients’ outcome (i.e., stable versus rejection). The public datasets are all taken from the Gene Expression Omnibus database. You can download the dataset by looking up its GSE ID in the database. Information about each dataset, for example, the paper that this data is published in, the experimental protocol, can also be found on the database. We will first focus on how to analyse the data from “GSE46474”. This dataset contains the gene expression profiles of 40 blood samples. Of those, 20 patients developed graft rejection and 20 had stable grafts and will be treated as controls. Note, you can go to the package website for installation details.
library(GEOquery)
library(R.utils)
library(reshape2)
library(ggplot2)
library(limma)
library(dplyr)
## gse <- getGEO("GSE46474")
##gse <- gse$GSE46474_series_matrix.txt.gz
load("data/GSE46474.RData")
gse
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54613 features, 40 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM1130812 GSM1130813 ... GSM1130851 (40 total)
## varLabels: title geo_accession ... tissue:ch1 (46 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... NA.25224 (54613 total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 25387159
## Annotation: GPL570
The actual expression matrix can be extracted using the
exprs function. Data about the phenotype of each patient
can be extracted using pData function. Similarly, we can
extract the featureData object using the fData
function.
head(pData(gse)[,1:5])
head(fData(gse)[,1:5])
head(exprs(gse)[,1:5])
eMat = exprs(gse)
gse$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable") #Tidy the title variable and call it Outcome.
table(gse$Outcome)
##
## Rejection Stable
## 20 20
## head(pData(gse)[,1:5])
## head(fData(gse)[,1:5])
## head(exprs(gse)[,1:5])
eMat = exprs(gse)
(a) What is the number of samples in the
Rejection class and the Stable class?
Suggestions:
gse$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable") #Tidy the title variable and call it Outcome.
table(gse$Outcome)
##
## Rejection Stable
## 20 20
(b) What is the size of this data matrix?
Suggestions:
dim(gse)
## Features Samples
## 54613 40
dim(eMat)
## [1] 54613 40
It has 54613 rows and 40 columns, meaning for each of the 40 patients, there are gene expression measurements across 54613 genes.
(c) [Exploration - optional] The
names of these features aren’t “genes” but the ID generated by the
biotechnology company that created this platform. In this case, the name
of the manufacture is call Affymetrix and the name of their
platform is known as
Affymetrix Human Genome U133 Plus 2.0 Array. This platform
aims to provide the complete coverage of the Human Genome for analysis.
Details of the platform can be found at the company’s
website. Here, the information gene symbols are stored
in the column Gene Symbol and the description of the genes
are stored in the column named Gene Title.
rownames(gse)[1:10]
## [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at"
## [7] "1316_at" "1320_at" "1405_i_at" "1431_at"
print(rownames(eMat[1:50, ]))
## [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at"
## [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at"
## [11] "1438_at" "1487_at" "1494_f_at" "1552256_a_at" "1552257_a_at"
## [16] "1552258_at" "1552261_at" "1552263_at" "1552264_a_at" "1552266_at"
## [21] "1552269_at" "1552271_at" "1552272_a_at" "1552274_at" "1552275_s_at"
## [26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"
## [31] "1552281_at" "1552283_s_at" "1552286_at" "1552287_s_at" "1552288_at"
## [36] "1552289_a_at" "1552291_at" "1552293_at" "1552295_a_at" "1552296_at"
## [41] "1552299_at" "1552301_a_at" "1552302_at" "1552303_a_at" "1552304_at"
## [46] "1552306_at" "1552307_a_at" "1552309_a_at" "1552310_at" "1552311_a_at"
featureData <- fData(gse) # GSE contain a featuredata slot that contains information about the genes that were sampled.
colnames(featureData)
## [1] "ID" "GB_ACC"
## [3] "SPOT_ID" "Species Scientific Name"
## [5] "Annotation Date" "Sequence Type"
## [7] "Sequence Source" "Target Description"
## [9] "Representative Public ID" "Gene Title"
## [11] "Gene Symbol" "ENTREZ_GENE_ID"
## [13] "RefSeq Transcript ID" "Gene Ontology Biological Process"
## [15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
head(featureData[,"Gene Symbol"])
## [1] "DDR1 /// MIR4640" "RFC2" "HSPA6" "PAX8"
## [5] "GUCA1A" "MIR5193 /// UBA7"
(a) Log transformation is a standard normalisation technique for RNA-seq data. Has our dataset has been log2 transformed? How can we determine this from looking at the data?
Suggestions:
summary(eMat[,1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.683 3.357 4.564 4.758 5.834 14.988
(b) Save a copy of this processed dataset for future
usage using write.csv.
write.csv(exprs(gse), "data/GSE46474_expression_matrix.txt")
It is critical that we examine the dataset. Create a series of boxplots to compare the sample distribution among different people. Are there any samples that appear to be significantly different from the others? Is it necessary to remove any samples (patients)?
[Discussion] Can you think of any other graphical techniques to visualise or analyse the data?
Suggestions:
library(reshape2)
## base R code is
## boxplot(eMat) or boxplot(melt(exprs(gse)))
p <- ggplot(melt(exprs(gse)), aes(x=Var2, y=value)) +
geom_boxplot(outlier.colour="black", outlier.shape=16,outlier.size=0.5, notch=FALSE) +
theme(axis.text.x = element_text(angle = 180, hjust = 1)) +
labs (x = "patient", y = "expression value") + theme_minimal()
p
summary(melt(exprs(gse))$value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.677 3.279 4.488 4.755 5.863 15.045
Let us select row 8251 and have a look. This represents the gene expression from a selected gene. Calculate the two sample t-statistic which could be used to test the null hypothesis, that the mean log(expression) for rejection individuals is the same as that for stable individuals. Assume that this t-statistic does indeed have a t-distribution; obtain the p-value for testing this hypothesis. Check for normality of the data using a qqplot.
Suggestions:
eMat[8251,]
## GSM1130812 GSM1130813 GSM1130814 GSM1130815 GSM1130816 GSM1130817 GSM1130818
## 8.634013 8.588576 9.420062 8.333498 9.635923 8.039824 8.824033
## GSM1130819 GSM1130820 GSM1130821 GSM1130822 GSM1130823 GSM1130824 GSM1130825
## 8.448080 8.578168 7.771895 8.159828 8.693350 8.835891 8.883193
## GSM1130826 GSM1130827 GSM1130828 GSM1130829 GSM1130830 GSM1130831 GSM1130832
## 9.010599 6.970322 10.048063 8.561478 8.824975 6.111392 9.425899
## GSM1130833 GSM1130834 GSM1130835 GSM1130836 GSM1130837 GSM1130838 GSM1130839
## 7.175468 8.835151 7.914015 10.413707 8.845688 8.146591 7.256471
## GSM1130840 GSM1130841 GSM1130842 GSM1130843 GSM1130844 GSM1130845 GSM1130846
## 8.467151 7.014702 9.046717 7.729293 8.965878 8.498529 9.658861
## GSM1130847 GSM1130848 GSM1130849 GSM1130850 GSM1130851
## 8.868084 8.849542 7.595218 8.984595 7.941705
boxplot(eMat[8251,] ~ gse$Outcome, ylab="log2 expression")
t.test(eMat[8251,] ~ gse$Outcome)
##
## Welch Two Sample t-test
##
## data: eMat[8251, ] by gse$Outcome
## t = 5.0164, df = 35.523, p-value = 1.474e-05
## alternative hypothesis: true difference in means between group Rejection and group Stable is not equal to 0
## 95 percent confidence interval:
## 0.6409244 1.5115623
## sample estimates:
## mean in group Rejection mean in group Stable
## 9.038282 7.962039
What are the highly differentially expressed genes between stable and
rejection patients? This is the same question as which genes have the
most different expression between stable and rejection patients. We can
perform a series of t-tests, and the bioinformatics community has
developed a package that speeds up this process. We will examine a
package called limma which fits a series of linear models
\[ Y = X\beta\] where \(X\) is known as the design
matrix. The theory and methods behind all this are covered in STAT3022 -
Applied linear models.
design <- model.matrix(~Outcome, data = pData(gse))
fit <- lmFit(exprs(gse), design)
fit <- eBayes(fit)
## Without gene symbols
library(DT)
tT <- topTable(fit, n = Inf)
DT::datatable(round(tT[1:100, ], 2))
## Adding gene symbols
tT <- topTable(fit, genelist=featureData[,"Gene Symbol"], n = 20)
Explore: Run the following code and explore the impact of multiple testing. Convince yourself via simulation.
cl <- factor(sample(c("YES", "NO"), 80, replace=TRUE))
fakeX <- matrix(rnorm(10000*80), nrow=10000)
design <- model.matrix(~ cl + 0 )
fakefit <- lmFit(fakeX, design)
cont.matrix <- makeContrasts(clYES - clNO, levels=design)
fakefit2 <- contrasts.fit(fakefit, cont.matrix)
fakefit2 <- eBayes(fakefit2)
Suggestions:
cl <- factor(sample(c("YES", "NO"), 80, replace=TRUE))
fakeX <- matrix(rnorm(10000*80), nrow=10000)
design <- model.matrix(~ cl + 0 )
fakefit <- lmFit(fakeX, design)
cont.matrix <- makeContrasts(clYES - clNO, levels=design)
fakefit2 <- contrasts.fit(fakefit, cont.matrix)
fakefit2 <- eBayes(fakefit2)
DT::datatable(round(topTable(fakefit2), 2))
qqnorm(fakefit2$t)
abline(0,1)
We will examine our analytical results using two types of scatter plots
logFC (y-axis) vs
AveExpr (x-axis) where M stands for Minus and A stands for
Add.-log(P.Value) (y-axis)
vs logFC (x-axis).## Code for MA plot
df<- topTable(fit, number=nrow(fit), genelist=rownames(gse))
ggplot(df, aes(x = AveExpr, y = logFC))+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
ylab("log2 fold change") + xlab("Average expression")
## Code to extract negative log p-value
p.value = -log10(df$P.Value))
Suggestions:
df<- topTable(fit, number=nrow(fit), genelist=rownames(gse))
p <- ggplot(df, aes(x = AveExpr, y = logFC))+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
ylab("log2 fold change") + xlab("Average expression")
p
p <- ggplot(df, aes(logFC,-log10(P.Value)))+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
xlab("log2 fold change") + ylab("-log10 p-value")
p
It would be nice if we can visualise our data with a 2D scatter-plot. However, for each patient, there are more than 10000 variables. How are we going to visualise this information for each patient? Similar to Lab 1, we can use a dimension reduction technique called PCA. PCA helps to reduce or summarise the information of a high dimensional object into a lower dimension. We can then use the first and second dimension, as the summary information of the >10000 genes, to plot a 2D scatter-plot. The PCA plot frequently gives us a good indication of how simple or difficult the relevant prediction task is. Please submit this plot for formative feedback and comment (1 sentence only) on what you learn from looking at this graph. Take care to make sure the graphic is properly labelled.
Suggestions:
gse_pca <- prcomp(t(exprs(gse)))
df_toplot <- data.frame(gse$Outcome,
pc1 = gse_pca$x[,1], pc2 = gse_pca$x[,2] )
g <- ggplot(df_toplot, aes(x = pc1, y = pc2, color = gse.Outcome)) +
geom_point(size = 4) +
theme_minimal()
g
From the scatter plot, each patient is represented by a data point, coloured by their rejection status. We can see that the rejection patients cannot be easily distinguished from stable patients. You might consider some filtering strategies before PCA.
Let’s apply the machine learning models we have learnt in the previous week to predict patient rejection status from gene expression data. One of the key challenges of building a classification model for biomedical data is facing the large p small n challenge. This means the number of variables (\(p\)) is often a lot more than the number of samples (\(n\)) which creates a lot of challenges, we address this in a ad hoc way by selecting a subset of genes (features). The following code will help you get started.
## quick filter to reduce computational time
largevar = apply(gse, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = as.matrix(t(exprs(gse[ind,])))
y = gse$Outcome
cvK = 5 # number of CV folds
cv_50acc5_knn = cv_50acc5_svm = cv_50acc5_rf = c()
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
n_sim = 25 ## number of repeats
for (i in 1:n_sim) {
cvSets = <write your own code for KNN> # permute all the data, into 5 folds
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
## KNN
fit5 = <write your own code for KNN>
cv_acc_knn[j] = <write your own code for KNN>
## SVM
svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
fit <- predict(svm_res, X_test)
cv_acc_svm[j] = <write your own code for KNN>
## RandomForest
rf_res <- randomForest::randomForest(x = X_train, y = as.factor(y_train))
fit <- predict(rf_res, X_test)
cv_acc_rf[j] = <write your own code for KNN>
}
cv_50acc5_knn <- append(cv_50acc5_knn, mean(cv_acc_knn))
cv_50acc5_svm <- <write your own code for KNN>
cv_50acc5_rf <- <write your own code for KNN>
} ## end for
Suggestions:
## quick filter to reduce computational time
largevar = apply(gse, 1, var)
ind = which(largevar > quantile(largevar, 0.9))
X = as.matrix(t(exprs(gse[ind,])))
y = gse$Outcome
cvK = 5 # number of CV folds
cv_50acc5_knn = cv_50acc5_svm = cv_50acc5_rf = c()
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
n_sim = 25 ## number of repeats
for (i in 1:n_sim) {
cvSets = cvTools::cvFolds(nrow(X), cvK) # permute all the data, into 5 folds
cv_acc_knn = cv_acc_svm = cv_acc_rf = c()
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
## KNN
fit5 = class::knn(train = X_train, test = X_test, cl = y_train, k = 5)
cv_acc_knn[j] = mean(fit5 == y_test)
## SVM
svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
fit <- predict(svm_res, X_test)
cv_acc_svm[j] = mean(fit == y_test)
## RandomForest
rf_res <- randomForest::randomForest(x = X_train, y = as.factor(y_train))
fit <- predict(rf_res, X_test)
cv_acc_rf[j] = mean(fit == y_test)
}
cv_50acc5_knn <- append(cv_50acc5_knn, mean(cv_acc_knn))
cv_50acc5_svm <- append(cv_50acc5_svm, mean(cv_acc_svm))
cv_50acc5_rf <- append(cv_50acc5_rf, mean(cv_acc_rf))
} ## end for
Visualise the comparison
boxplot(list(SVM = cv_50acc5_svm, KNN = cv_50acc5_knn , RF= cv_50acc5_rf ), ylab="CV Accuracy")
One of the assignment 1 questions will be based on the following task.
Blood vs Biopsy Biomarker.
We estimated the accuracy for our predictive model in graft rejection from a peripheral blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself. Here we will look at a similar kidney microarray dataset. But instead of genes being isolated and sequenced from blood, they have been sequenced from a kidney biopsy. How do these differ when compared to the peripheral blood models we generated earlier? Can you make a informative plot?
[Optional] What reasons can you think of that account for this difference in model accuracy?
The code below illustrate how to download another dataset. We will
also make GSE138043.RData available.
gse <- getGEO("GSE138043")
gse <- gse$GSE138043_series_matrix.txt.gz
gse <- load("GSE138043.RData")
library(preprocessCore)
exprs(gse) <- normalize.quantiles(exprs(gse)) # Normalise Data
boxplot(normalize.quantiles(exprs(gse)))
This section is optional as a head-start to the next
case study which looks at data used to detect eye movement. A sequence
of data from the Spiker box was created by two physics tutors. These
signals are recorded and saved as WAV files. We’ll start our
investigation by looking at the data in the file
LRL L3.wav.
Data loading: With the code below, we’ll read the
.wav file using the tuneR package’s
readWave function. We use two functions unique to the S4
class to retrieve information:
- The function slotNames returns information about the
components (termed slots) for a given object.
- To extract a specific slots of an S4 object, you use @
such as waveSeq@left. If you’re curious, you may learn all
about the S4 object framework here, however it’s not needed
for the course.
Explanations of the slotNames:
- left: Recordings (as sample point) in the left
channel.
- right: Recordings (as sample point) in the right
channel.
- stereo: Whether the sounds are recorded in stereo.
- samp.rate: The number of samples per second. If the
sampling rate is 4000 hertz, a recording with a duration of 5 seconds
will contain 20,000 samples.
- bit: the number of bits in each sample, and can be
considered as a form of resolution.
- pcm: stands for Pulse Code Modulation, is a method to
capture audio waveforms digitally.
## install.packages(tuneR)
library(tidyverse)
library(tuneR)
library(ggplot2)
waveSeq <- readWave("data/LRL_L3.wav")
waveSeq
##
## Wave Object
## Number of Samples: 72688
## Duration (seconds): 7.27
## Samplingrate (Hertz): 10000
## Channels (Mono/Stereo): Mono
## PCM (integer format): TRUE
## Bit (8/16/24/32/64): 16
slotNames(waveSeq)
## [1] "left" "right" "stereo" "samp.rate" "bit" "pcm"
# time (in second) of the sequencing
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
Quick visualization: From the code above, we see that there are 72688 values and we have 10000 observations per second. The code below provide a simple way visualize the data using the R base function. The three eye movements in this file are displayed in the title of the file “LRL” denoting a left eye movement followed by a right and left eye movement. Visually can you distinguish the pattern from left eye movement versus right eye movement?
plot(timeSeq, waveSeq@left, type = "l", ylab="Signal", xlab="Time(seconds)")
ggPlot visualization: In this data, we denote any
eye movement as an event, so across the \(x\)-axis, we have values identified as part
of a left/right eye movement (event) or no eye movement (no event). From
the figure, we observed that the three events (left, right, left) that
occur at event times are 1.8, 3.5 and 5 seconds, respectively. The
following code used ggplot to graph the data and highlight
an interval of 1 second over the three events. Take a look to see if you
can understand the R code, and we will go over it further in the Week 2
lecture and Week 3 lab.
## Define the table by eye
eventTable <- data.frame(eventTime = c(1.8, 3.5, 5),
eventType= c("Left", "Right", "Left"))
## Construct a data frame (df)
df <- data.frame(Y = waveSeq@left,
time = timeSeq,
event_type = "none",
event_time = NA,
event_pos = NA)
## Check class type
sapply(df, class)
## Y time event_type event_time event_pos
## "integer" "numeric" "character" "logical" "logical"
df$event_type = as.character(df$event_type)
## Create highlights
for (i in 1:nrow(eventTable)) {
t_idx <- (df$time < (eventTable[i, 1] + 0.5)) & (df$time > (eventTable[i, 1] - 0.5))
df$event_type[t_idx] <- as.character(eventTable[i, 2])
df$event_time[t_idx] <- seq_len(sum(t_idx))
df$event_pos[t_idx] <- i
}
df$event_type <- factor(df$event_type, levels = c("Right", "Left", "none"))
ggplot(df, aes(x = time, y = Y, col = event_type, group = 1)) + geom_line() +
scale_color_manual(values = c("#E41A1C", "#377EB8", "black")) +
theme_bw()